Skip to content

Support http2 upstream phase3#32

Open
mwfj wants to merge 11 commits into
mainfrom
support-http2-upstream-phase3
Open

Support http2 upstream phase3#32
mwfj wants to merge 11 commits into
mainfrom
support-http2-upstream-phase3

Conversation

@mwfj
Copy link
Copy Markdown
Owner

@mwfj mwfj commented May 13, 2026

HTTP/2 upstream Phase 3 (PR-2): wait-queue + checkout/lifecycle overhaul

Ships the wait-queue and cold-start lifecycle overhaul tracked in HTTP2_UPSTREAM_PHASE_RECONCILIATION.md §2 (PR-2). Touches partition, lease, H2 connection, and proxy transaction. Held-fallback retry on truncation (PR-1.5) and the per-(host,port,endpoint) ALPN cache remain deferred to a follow-up.

Summary

Area Change
Cold-start H2 probe New OpenNewH2Connection + OnH2ConnectHandshakeComplete single-disposition state machine. Single outbound connect per (host, port); concurrent CheckoutAsync calls dedup onto the in-flight probe via h2_connecting_conns_.count(key).
ALPN resolution prefer=auto → adopt the negotiated transport into either H2 (AdoptLease + h2_table_.Insert) or the H1 idle pool (MarkTransferred + AdoptAsH1Connection + ReclassifyH2WaitersToAny + ServiceWaitQueue). prefer=always strictly fails on non-h2 ALPN.
Wait queue New WaiterKind { ANY, H2_STREAM_SLOT } discriminator. ServiceWaitQueue's idle-pop and create-new branches skip H2_STREAM_SLOT entries; admission flows through DrainH2StreamWaitersForHost after a probe resolves to h2.
GOAWAY handling OnGoawayReceived with active_streams==0 now hands the dying conn to pending_destroy_h2_conns_ (deferred-destroy stash, reaped from HandleBytes post-recv flush — avoids UAF in the recv-callback chain) and fires StartH2ReplacementConnect(host, port) to keep queued waiters from timing out.
Retry classification New RESULT_GOAWAY_UNPROCESSED = -12 (peer demonstrably never processed the stream; zero-delay first retry via CONNECT_FAILURE) and RESULT_GOAWAY_MAYBE_PROCESSED = -13 (standard retry). Both breaker-neutral (connection-lifecycle, not health signal). New IsH2RetryableCode / MapH2CodeToRetryCondition helpers.
Lease shape UpstreamLease is now a tagged variant Kind { EMPTY, H1, H2 }. H2 ctor carries (partition_alive, conn_alive) dual tokens; accessors (GetH2Connection, GetH2StreamId, GetH2Stream) gate on both.
Pool table H2ConnectionTable migrated from shared_ptrunique_ptr ownership with explicit Extract(UpstreamH2Connection*) for MoveConnToPendingDestroy.
Lifecycle New DestroyOnDispatcher 6-step ordering (alive flip → callback null → timer cleanup → session teardown → MarkClosing → lease.reset → destroyed flag) + RunDeferredEraseWalk two-phase stream-erase walker.
New keys HostPortKey user-defined struct in new header (deliberately not a std::pair alias — specializing std::hash for a library type is UB per [namespace.std]).
Tests ~70 new tests across h2_upstream_test.h (S6–S13, A6b, A6c, B12b update, N9o/N9p early-final-headers + shutdown-kill, B17/B18/B19 wire RST/GOAWAY) and upstream_pool_test.h (lease Kind discrimination, HostPortKey hash + equality, extended error codes).

File-level scope

File Role
include/upstream/host_port_key.h (new) User-defined struct + std::hash specialization
include/upstream/pool_partition.h, server/pool_partition.cc OpenNewH2Connection, OnH2ConnectHandshakeComplete, AdoptAsH1Connection, ReclassifyH2WaitersToAny, StartH2ReplacementConnect, MoveConnToPendingDestroy, ReapPendingDestroyH2Conns, EnqueueH2StreamSlotWaiter, DrainH2StreamWaitersForHost, FailH2StreamSlotWaiters, WaiterKind, h2_connecting_conns_, pending_destroy_h2_conns_
include/upstream/upstream_h2_connection.h, server/upstream_h2_connection.cc partition_ back-pointer + SetPartition, MarkTransferred, transferred_ flag, DestroyOnDispatcher 6-step, RunDeferredEraseWalk, GOAWAY-idle branch, ClearH2TransportCallbacks helper, dtor safety-net
include/upstream/upstream_lease.h Tagged variant Kind { EMPTY, H1, H2 }, H2 ctor with dual tokens, H2 accessors
include/upstream/h2_connection_table.h, server/h2_connection_table.cc unique_ptr migration, Extract(UpstreamH2Connection*)
include/upstream/proxy_transaction.h, server/proxy_transaction.cc RESULT_GOAWAY_UNPROCESSED / RESULT_GOAWAY_MAYBE_PROCESSED, IsH2RetryableCode, MapH2CodeToRetryCondition, h2_conn_ + h2_conn_alive_ dual-token capture, H2ConnAlive() accessor
include/upstream/upstream_h2_stream.h, include/upstream/upstream_response_sink.h Sink contract additions for H2 driver-loop coverage
include/config/server_config.h, server/config_loader.cc Config plumbing for the new lifecycle hooks
Makefile host_port_key.h added to UPSTREAM_HEADERS
test/h2_upstream_test.h, test/upstream_pool_test.h ~70 new test cases

Notable correctness work uncovered during review

The gateway-code-reviewer pass surfaced two real bugs that the test suite missed; both fixed in this PR:

  1. OpenNewH2Connection's RegisterOutboundCallbacks throw path leaked outstanding_conns_++ + the parked transport in connecting_conns_. The pre-fix catch handler synchronously recursed into OnH2ConnectHandshakeComplete(CHECKOUT_CONNECT_FAILED), but at that point the H2 shell had not yet been inserted into h2_connecting_conns_, so the disposition handler short-circuited and skipped ExtractFromConnecting. Fix: drop the local shell (so its safety-net dtor doesn't race the synchronous close-callback chain), then ExtractFromConnecting(raw_uc) + DestroyConnection(owned). DestroyConnection's ForceClose fires the close-cb synchronously, the close-cb calls OnH2ConnectHandshakeComplete with shell=null + uc_raw=null and fails queued waiters; DestroyConnection then decrements outstanding_conns_ and tears down the transport.
  2. ClearH2TransportCallbacks could clobber the H1 borrower's callbacks just installed by WirePoolCallbacks. If ~UpstreamH2Connection's safety-net path ran after AdoptAsH1Connection, it would null-assign SetOnMessageCb / SetCloseCb / etc. on the freshly-adopted H1 idle conn. Fix: MarkTransferred() sets transferred_=true BEFORE adoption, and both DestroyOnDispatcher and ~UpstreamH2Connection short-circuit on the flag.

Defense-in-depth additions:

  • AcquireH2Connection now defensively nulls SetConnectCompleteCallback / SetHandshakeCompleteCallback at the H2 session install path (eliminates a class of "closures from a prior owner survive across a promotion boundary" footguns).

Test results

make clean && make -j4 && ./test_runner
Total Tests: 1398 | Passed: 1398 | Failed: 0
Success Rate: 100%

Per-suite:

  • ./test_runner h2_upstream — 128/128
  • ./test_runner upstream — 42/42

CI coverage

./test_runner upstream and ./test_runner h2_upstream were already wired in ci.yml (tsan-rest enumeration + macOS subset) and weekly-valgrind.yml. PR-2 extends existing suites; no new suite registration required.

Out of scope

  • PR-1.5 — Held-fallback retry on truncation. Currently RESULT_TRUNCATED_RESPONSE is terminal (502 BadGateway). Closes r85 criteria 22-24, 26, 30, 57, 59, 70.
  • Phase 2 — Saturation, multi-conn-per-host, predictive preconnect. Separate branch.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b69c248d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/pool_partition.cc Outdated
Comment thread server/proxy_transaction.cc Outdated
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive overhaul of HTTP/2 connection and stream lifecycle management. Key additions include a "held-fallback" retry mechanism for buffering 5xx responses, improved GOAWAY frame handling to differentiate between processed and unprocessed requests, and a transition to unique_ptr ownership for connections within the H2 table. The implementation utilizes a dual-token pattern (raw pointer combined with an atomic liveness flag) to ensure thread-safe access from asynchronous transport callbacks. One high-severity issue was identified in the DrainPausedBuffer implementation, where a lambda capture lacks a liveness check, creating a potential use-after-free vulnerability if the connection is destroyed before the replay drain listener fires.

Comment thread server/upstream_h2_connection.cc Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant